Skip to content

Conversation

@noalimoy
Copy link

Summary

This PR adds comprehensive Kubernetes deployment support for llm-katan, enabling multi-instance deployments with model aliasing capabilities.

Kubernetes Manifests (Kustomize-based)

  • Base deployment with security contexts and health probes
  • PersistentVolumeClaim (5Gi) for efficient model caching
  • Service (ClusterIP) exposing port 8000
  • Namespace isolation (llm-katan-system)

Multi-Instance Support (Overlays)

  • gpt35 overlay: Serves gpt-3.5-turbo alias
  • claude overlay: Serves claude-3-haiku-20240307 alias
  • Isolated PVCs per instance (prevents ReadWriteOnce conflicts)
  • Common labels component for consistent resource labeling

Model Caching Optimization

  • InitContainer (model-downloader) pre-downloads models to PVC
  • Smart caching: Skips download if model exists
  • Uses python:3.11-slim + hf download for ~45MB lightweight init
  • Main container starts instantly with cached model

Bug Fix (config.py)

  • Added YLLM_SERVED_MODEL_NAME environment variable support
  • Previously only worked via CLI arguments
  • Now enables Kubernetes env-based configuration

Documentation

  • Comprehensive deployment guide (deploy/docs/README.md)
  • Architecture explanation (Pod structure, storage, networking)
  • Kind cluster setup examples
  • Troubleshooting section with common issues

Test Results

Deployment Validation (Kind Cluster)

Resources Created:

  • Namespace: llm-katan-system
  • Deployments: llm-katan-gpt35, llm-katan-claude (both 1/1 Running)
  • Services: llm-katan-gpt35, llm-katan-claude (ClusterIP, port 8000)
  • PVCs: llm-katan-models-gpt35, llm-katan-models-claude (both 5Gi Bound)

API Validation:

Motivation

This implementation addresses the need for:

  • Cloud-native deployments: Production-ready Kubernetes manifests
  • Multi-instance testing: Run multiple model aliases simultaneously
  • Efficient resource usage: Model caching prevents redundant downloads
  • Testing flexibility: Easy overlay creation for new model aliases

The Kustomize structure enables:

  • Consistent base configuration
  • Environment-specific customization via overlays
  • Easy addition of new model aliases without base changes

Related issue: #278

- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude)
- Implement initContainer for efficient model caching using PVC
- Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables
- Add deployment documentation with examples for Kind cluster / Minikube

This enables running multiple llm-katan instances in Kubernetes, each
serving different model aliases while sharing the same underlying model.
The overlays (gpt35, claude) demonstrate multi-instance deployments where
each instance exposes a different served model name (e.g., gpt-3.5-turbo,
claude-3-haiku-20240307) via the API.

The served model name now works via environment variables, enabling
Kubernetes deployments to expose diffrent model name via the API.

Signed-off-by: Noa Limoy <[email protected]>
@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 04e7542
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/691fa77afa818c0008140a9c
😎 Deploy Preview https://deploy-preview-710--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 e2e-tests

Owners: @yossiovadia
Files changed:

  • e2e-tests/llm-katan/deploy/docs/README.md
  • e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml
  • e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml
  • e2e-tests/llm-katan/llm_katan/config.py

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants